Importing Packages

Load Data

Data Understanding

This is a telecommunications company's customer dataset, containing various demographic and usage information for each customer, as well as whether or not they have churned (i.e. cancelled their service). Here are the meanings of the columns:

Exploratory Data Analysis

Data Overview

Comment

SeniorCitizen is actually a categorical hence the 25%-50%-75% distribution is not propoer

75% customers have tenure less than 55 months

Average Monthly charges are USD 64.76 whereas 25% customers pay more than USD 89.85 per month

Issues with Dataset:

Approach:

Data Cleaning

Cleaning TotalCharges Column
Creating bins for tenure column
Dropping Columns

Questions:

1. What is the most preferred internet service of customers?

2. Understand customer demographics with respect to gender?

4. Are customers without dependents likely to have higher charges than those with dependent?

Hypothesis

Null Hypothesis : Senior citizen does not correlate with the tendency of customer churn
Alternate Hypothesis : Senior citizen correlate with the tendency of customer churn

Analysis

Univariate Analysis
Univariate Analysis

Derived Insight:

HIGH Churn seen in case of Month to month contracts, No online security, No Tech support, First year of subscription and Fibre Optics Internet

LOW Churn is seens in case of Long term contracts, Subscriptions without internet service and The customers engaged for 5+ years

Hypothesis Testing
Senior citizens and the tendecy to the customer churn

Null Hypothesis : Senior citizen does not correlate with the tendency of customer churn.
Alternate Hypothesis : Senior citizen correlate with the tendency of customer churn.

Insights:
P-value is less than 0.05, which implies that we reject our null hypothesis. Senior citizen are more likely to churn.

Gender and the tendency to customer churn

Null Hypothesis :Gender does not correlate with the tendency of customer churn.
Alternate Hypothesis : Gender citizen correlate with the tendency of customer churn.

  1. Convert all the categorical variables into dummy variables
  1. Relationship between Monthly Charges and Total Charges

Insights:
P-value is greater than 0.05, which implies that we accept our null hypothesis. A customer's tendency to churn soes not depend on their gender.

1. What is the most preferred internet service of customers?

2. Understand customer demographics with respect to gender

Insights:

4. Are customers without dependents likely to have higher charges than those with dependent?

Feature Processing - ML

Feature Scaling

Feature Encoding

Data Splitting

Balancing Dataset

Approach:

Modelling

The following models would be used :

1. Decision Tree Classifier - Without balancing

Decision Tree Predictions
Decision Tree Classification Report

As you can see that the accuracy is quite low, and as it's an imbalanced dataset, we shouldn't consider Accuracy as our metrics to measure the model..

Hence, we need to check recall, precision & f1 score for the minority class, and it's quite evident that the precision, recall & f1 score is too low for Class 1, i.e. churned customers.

Decision Tree Classifier with balancing

Now we can see quite better results, i.e. Accuracy: 62%, and a very good recall, precision & f1 score for minority class. Let's try with some other classifier.

2. Random Forest Classifier - Without Balancing

Comment:

Good enough, however lets check with the balance dataset

2. Random Forest Classifier - With Balancing

Comment:

After balancing, the f1 score has marginally improved at the expense of accuracy. This is because prior balancing, the machine was more biased towards the majority class

3. Gradient Boosting Classifier Model _without balancing

Gradient Boosting Classifier with balancing

Comment:

Yes, a far better result with the Gradient Boosting Model on the balanced dataset.. we can still check for more classifiers

4. Logistic Regression - without balancing

Logistic Regression - with balancing

Comment

We can see this also is far better: Now that we now the balanced data is better in results.

Model Evaluation

Comment:

From the two tables, it can be observed that:

Hyperparameter Tuning

Tuning Gradient Boosting Classifier Model

Tuning the Logistic Regression Model

Exporting Key Components